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DIVERSITY OF VOCABULARY AND THE HARMONIC 
SERIES LAW OF WORD-FREQUENCY 
DISTRIBUTION** 


By J. B. Carrot 
University of Minnesota 


Samples of the speech or writing of individuals varying in 
age, intelligence, and background will be found to differ in 
what may be termed diversity, i.e., the relative amount of 
repetitiveness or the relative variety in vocabulary. Of two 
samples of equal length the one of low diversity has fewer 
different words, most of them common; the sample of higher 
diversity contains a greater number of different words, so 
that each word has a lower frequency. Some index of 
diversity is obviously desirable. A simple expedient would 
be to take the mean number of different words in samples 
of a standard size—say, 1,000 words—but since we cannot al- 
ways obtain this information (where, for example, we know 
only the number of different words in some larger or smaller 
sample), it is desirable to make use of what may be called 
the diversity curve, which shows the relation between the 
number of words in the sample (N) and the number of dif- 
ferent words (d).* A constant in the equation of this curve 
provides a useful index of diversity. 

An attempt is made here to derive an equation for the 
diversity curve on the basis of certain hypotheses which 
have been empirically confirmed to a considerable extent in 
previous studies. E. V. Condon’s early formulation of the 
harmonic series hypothesis (1) was inadequate because in 
effect it made diversity a function of the size of the sample. 
G. K. Zipf’s essentially similar formulation (11, p. 45) pro- 
vides that the most frequent word constitute 1/10 of all the 
words in a sample; that the next most frequent word con- 
stitute 1/20 of all the words in the sample; the next most 
frequent, 1/30 of all the words, etc. The harmonic series hy- 
pothesis also states that for each word in a given sample 


*For our purposes every linguistic form is counted as a separate 
word, e.g., do, does, did are all discrete entities. 


** Recommended for publication by Dr. B. F. Skinner, Nov. 10, 1938. 
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fr* = K, where K and gz are empirical constants for the 
sample; f = the no. of times a given word occurs in the sam- 
ple; and r stands for the rank of a word in the given sample, 
the ranks being in order of descending frequency. The rank 
assigned to words of equal frequency is an average. In most 
cases xr = 1, though it may vary slightly with the size of 
the sample. For the present it may be assumed to take the 
value 1.00. Zipf has further recognized (12) that account 
may be taken of diversity (Zipf’s “average rate of repetitive- 
ness”) by assuming that the word of rank 7 will occur on 
the average once in every kr words, where k is an empirical 
constant whose value is normally 10, but varies inversely 
with the “rate of repetitiveness.” For us kK may serve as a 
direct index of diversity. In a sample of N words, the word 


N N N 
of rank r will occur times. That is, f = fr=y = 


The number of different words in a sample of N words may 
then be derived as a function of N as follows: 
If the statistical probability of the occurrence of word r is 


a , then theoretically there will be a number of words which 


will certainly occur in a sample of size N, viz., all the words 


from r= ltr =. these will theoretically occur one or 


more times. In samples of moderate size, these x words 
will not make up the total sample, and the remainder of the 
sample is hence made up of words whose statistical chances 
of occurring are less than x , i.e., where f = Pa < 1. Since 


these words would occur in a larger sample, their occur- 
rences in a smaller sample are fortuitous. Hence this resi- 


due is (N — the total frequencies of the first x words), or 


ra 
n- > (1) 
r=1 


where has one term corresponding to each different word. 
Then the number of different words in the sample, includ- 


ing the first x words is 
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d-Hen-N (2) 


r=1 


since the last sum of F is approximately* (0.577 + log. x Be 


when _ > about 20, 


d= (.423+4k— log,N + log, k) (3) 
Or, 
k(1- 4.) (4) 


By substituting the appropriate values of d and N in any 
given case from empirical data, equation (4) may be solved 
for k, the index of diversity, if the harmonic series is known 
to apply. A test of equation (3) is afforded by plotting its 
curve when Kk has been found in a particular sample, and by 
comparing the observed with the theoretical points. This 
has been done in the case of material which the writer has 
collected.** The mean number of different words per 20) 
words of this material was 137.8, with a standard deviation 
of 10.0. In order to make the points on the empirical diver- 
sity curve comparable, the blocks of 200 words were succes- 
sively pooled with the cumulated sample in such a way that 
the mean d per 200 words was kept at approximately 137.8, 
the value obtained from the total sample. For this material 
k was found to be 8.76, where N = 8,000 and d = 2,155 for 
the total sample. The observed points and the theoretical 
diversity curve are plotted in Fig. 1. The virtual agreement 
between the two sets of values is shown numerically in Table 
I. This table also gives the observed values for a short word- 
count of Santayana’s Last Puritan. It happens that these 
values are very close to those of the writer’s material and to 
those calculated from the equation. The Last Puritan ma- 


* Even though not all ranks are whole numbers. 


** This sample was obtained by asking college students to fill in pat- 
terns of capital letters and asterisks with words of their own choice. For 
example. * Y F * might have elicited the response, Js your father sick? 
This method of collecting latent speech is somewhat analogous to the 
method of Skinner’s verbal summator (8). 
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terial was pooled in successive blocks of 200 words in the or- 
der in which they stood in the original text. 


Equation (3), of course, only applies where fr! = ® tor 


most values of f. The writer’s material yields this equation 
satisfactorily, except for that part of the distribution where 
r has values of 1 to about 30. This discrepancy, however, 
does not seem to invalidate equation (3).* 


TABLE I 


OBSERVED AND THEORETICAL VALUES OF d FOR THE 
WRITER’S LETTER-STAR MATERIAL, WITH SEVERAL 
VALUES OF d FROM SANTAYANA’S LAST PURITAN 


Theoret. |Letter-| Santa- Theoret. | Letter- 
N 6d (k=8.76)| Star yana N d (k=8.76) Star 
d d d 
200 138.2 138 | 145 3000 | 1144 1142 
400 245 250 | 252 4000 1392 1394 
600 339 345 | 342 5000 1616 1612 
800 426 431 | 423 6000 1809 | (1807 
1000 507 516 | 513 7000 1990 1992 
1200 583 590 | 590 || 8000 (2155) 2155 
1400 655 667 | 668 
1600 | 725 738 | 735 || | 
1800 790 793 | 816 1 
2000 | 854 847 | 886 | 


The diversities of other samples which are known to fol- 
low the simple harmonic series law may easily be calculated. 
Table II gives N, d, and k for several such samples where the 
data are available. The data from Joyce’s Ulysses are in- 
cluded with hesitation since the exponent of 7 is about 1.07. 


*It has been suggested (9) that, in the distribution of associated 
words, for a discrepancy of this sort in the most frequen} words ther? 
may exist a compensation in the form of an upward shift of the rest of 
the harmonic series curve without change of slope. This explanation 
may also hold here. 
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TABLE II 
DIVERSITIES OF SEVERAL PUBLISHED SAMPLES 


N d k 
Writer’s Material 8,000 2,155 8.76 
Verbal Summator (8) 3,046 806 7.60 
Eldridge (3) 43,989 6,002 9.3 
Dewey (2) 100,000 10,161 9.8 
Joyce’s Ulysses (4) 260,430 29,899 [10.9] 


An important consequence of equation (3) is that the har- 
monic series law will not apply to very large samples unless 
an extremely high diversity exists. The maximum value of 


d for any value of k is where L xs = 1, i.e., where the first 


x words exhaust the total sample. The higher the diversity, 


the larger the sample for which fr = K can hold. For k = 10, 
the maximum value of d = about 12,100. Beyond the maxi- 
mum value of d the harmonic series may still hold for a 
part of the distribution, but a compensation must exist 
either as a diminution of the frequencies of the commonest 
words or as a change in the exponent of r from the “normal” 
value of 1.00. Further study of the (probably independent) 


variation of x and k in fre = x is needed. Unfortunately, 
such study is hindered by the lack of adequate tables to 


accomplish the summation of series of the type a where x 


is not integral. 

When curves of different values of k are plotted on coordi- 
nates of d and N (Fig. 2), there are large areas where points 
could not occur on the ascending portion of any diversity 
curve, but could only occur on a portion of a curve as it 
descends towards a zero value of d. This would be true, for 
example, of the Thorndike sample (10) where N = 4,500,000; 
d = approximately 55,000; k = 12.5; also of the Brandenburg 
4-year-old child G sample (6) where N = 14,930; d = 999; k 
= 7.67. These values of k are meaningless, for wherever N is 
larger than its value at the maximum value of d (here 
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FIG. 2 
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Theoretical curves of equation (3), plotted for several values of k. 
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N = dk), d is equal to or smaller than its maximum value, 
and any point will be on the descending portion of a curve 
whose ascending portion is not truly characteristic of the 
sample. The simple harmonic series law, at least, cannot 
hold for such samples. 

Though the diversity equation derived here cannot hold 
when the curve is extended out to large values of N, there 
is a suggestion that the diversity of a small sample has a 
definite relation to the total vocabulary represented by the 
sample, since if we assume the validity of the harmonic 
series law there is a maximum value of d for a given k, as 
has been already indicated. The study of vocabulary size 
would possibly be facilitated by the application of a measure 
of diversity such as has been suggested here, together with 
tables of the expected total vocabulary associated with dif- 
ferent values of k, the tables having been constructed as a 
result of both theoretical and empirical investigations. A 
test of individual vocabulary might possibly be devised on 
this basis. An index of diversity might also be used to dif- 
ferentiate linguistic materials with respect to stylistic or 
other characteristics. 
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